BalanceSheet of companies is given as data.
Objective: A business user would like to get insights into the financial health of companies who are EDB’s customers. They are worried about the risk of these companies being unable to complete their projects due to financial reasons and wish to implement more stringent financial checks and controls on riskier companies.
To measure company's fianncial health, several fiancial metrics are considered.
(1) Liquidity : Liquidity is the amount of cash and easily-convertible-to-cash assets a company owns to manage its short-term debt obligations. Before a company can prosper in the long term, it must first be able to survive in the short term.
The two most common metrics used to measure liquidity are the current ratio and the quick ratio.
Quick ratio is present in our dataset. The quick ratio measures a company's capacity to pay its current liabilities without needing to sell its inventory or obtain additional financing.
Quick Ratio = "Quick Assets" / Current Liabilities
where Quick Assets = Cash + Cash Equivalents + Marketable Securities + Net Accounts Receivable
Quick ratio below 1 shows higher risk.
(2) Solvency: a company's ability to meet its debt obligations on an ongoing basis, not just over the short term.
debt-to-equity (D/E) ratio is generally a solid indicator of a company's long-term sustainability.
D/E ratio = Total Liabilites / Total shareholders Equity
(3) Operating Efficiency : Operating margin is one of the best indicators of efficiency. This metric considers a company's basic operational profit margin after deducting the variable costs of producing and marketing the company's products or services
(4) Profitability : Companies can survive for years without being profitable, operating on the goodwill of creditors and investors. But to survive in the long run, a company must eventually attain and maintain profitability. A good metric for evaluating profitability is net margin, the ratio of net profits to total revenues.
(Source: Investopedia.com)
# Loading the libraries
# Basic Libraries
import numpy as np
from numpy.random import seed
import pandas as pd
# Univariate EDA
from pandas_profiling import ProfileReport
# Data Vsiualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)
import warnings
warnings.filterwarnings('ignore')
# # Modelling
from sklearn.metrics import roc_auc_score, f1_score, precision_recall_curve, confusion_matrix
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_val_score
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
import shap
SEED = 7
np.random.seed(SEED)
pd.set_option('display.float_format', lambda x: '%.6f' % x)
pd.options.display.max_columns = None
pd.options.display.max_rows = None
df = pd.read_excel('Take Home Assignment Data.xlsx')
df.drop(['Unnamed: 0'], axis=1, inplace=True)
def show_data_info(data, data_types):
'''
Discription: Shows Data Information
'''
data.info()
print("\n")
for dtype in data_types :
data_type = data.select_dtypes(include=dtype).dtypes
if len(data_type) > 0 :
print(str(len(data_type))+" "+dtype+" Features\n"+str(data_type)+"\n" )
# Display Data Information of "train" :
data_types_list = ["float32","float64","int32","int64","object","category","datetime64[ns]"]
show_data_info(df, data_types_list)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 160 entries, 0 to 159 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Financial Year 160 non-null int64 1 Description Of Presentation Currency 160 non-null object 2 Assets 160 non-null float64 3 Goodwill 8 non-null float64 4 Intangible Assets Other Than Goodwill 62 non-null float64 5 Liabilities 160 non-null float64 6 Cash And Bank Balances 154 non-null float64 7 Trade And Other Receivables Current 154 non-null float64 8 Current Assets 160 non-null float64 9 Current Liabilities 160 non-null float64 10 Other Current Financial Assets 11 non-null float64 11 Other Noncurrent Financial Assets 14 non-null float64 12 Other Current Financial Liabilities 51 non-null float64 13 Other Noncurrent Financial Liabilities 8 non-null float64 14 Cash Flows From Used In Operating Activities 160 non-null float64 15 Current Loans And Borrowings 64 non-null float64 16 Noncurrent Loans And Borrowings 49 non-null float64 17 Profit Loss 155 non-null float64 18 Profit Loss Before Taxation 160 non-null float64 19 Finance Costs 126 non-null float64 20 Revenue 160 non-null float64 21 Net Tangible Assets margin 160 non-null float64 22 Operating cash flow rati 160 non-null float64 23 Debt ratio 160 non-null float64 24 NPBT 160 non-null float64 25 Unnamed: 25 0 non-null float64 26 Quick Ratio 160 non-null float64 27 Unnamed: 27 0 non-null float64 28 Interest coverage ratio 160 non-null float64 29 Risk 160 non-null object dtypes: float64(27), int64(1), object(2) memory usage: 37.6+ KB 27 float64 Features Assets float64 Goodwill float64 Intangible Assets Other Than Goodwill float64 Liabilities float64 Cash And Bank Balances float64 Trade And Other Receivables Current float64 Current Assets float64 Current Liabilities float64 Other Current Financial Assets float64 Other Noncurrent Financial Assets float64 Other Current Financial Liabilities float64 Other Noncurrent Financial Liabilities float64 Cash Flows From Used In Operating Activities float64 Current Loans And Borrowings float64 Noncurrent Loans And Borrowings float64 Profit Loss float64 Profit Loss Before Taxation float64 Finance Costs float64 Revenue float64 Net Tangible Assets margin float64 Operating cash flow rati float64 Debt ratio float64 NPBT float64 Unnamed: 25 float64 Quick Ratio float64 Unnamed: 27 float64 Interest coverage ratio float64 dtype: object 1 int64 Features Financial Year int64 dtype: object 2 object Features Description Of Presentation Currency object Risk object dtype: object
def show_top_bottom_rows(df, head_rows, tail_rows):
'''
Description: Show specified top and bottom rows of dataframe
'''
display('Dataframe head and tail: ')
display(df.head(head_rows).append(df.tail(tail_rows)))
show_top_bottom_rows(df, head_rows=3, tail_rows=2)
'Dataframe head and tail: '
| Financial Year | Description Of Presentation Currency | Assets | Goodwill | Intangible Assets Other Than Goodwill | Liabilities | Cash And Bank Balances | Trade And Other Receivables Current | Current Assets | Current Liabilities | Other Current Financial Assets | Other Noncurrent Financial Assets | Other Current Financial Liabilities | Other Noncurrent Financial Liabilities | Cash Flows From Used In Operating Activities | Current Loans And Borrowings | Noncurrent Loans And Borrowings | Profit Loss | Profit Loss Before Taxation | Finance Costs | Revenue | Net Tangible Assets margin | Operating cash flow rati | Debt ratio | NPBT | Unnamed: 25 | Quick Ratio | Unnamed: 27 | Interest coverage ratio | Risk | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019 | SGD | 0.003707 | 0.255638 | NaN | 0.007336 | 0.002338 | 0.024389 | 0.020782 | 0.014719 | NaN | NaN | NaN | NaN | 0.471081 | 0.007778 | NaN | 0.133511 | 0.136606 | 0.084775 | 0.031574 | 0.885845 | 1.497445 | 0.008646 | 0.240237 | NaN | 1.247085 | NaN | 238.290735 | low |
| 1 | 2019 | SGD | 0.000445 | NaN | NaN | 0.001045 | 0.001389 | 0.003970 | 0.002791 | 0.002026 | NaN | NaN | NaN | NaN | 0.438466 | NaN | NaN | 0.112058 | 0.111797 | 0.081475 | 0.001734 | 0.867937 | 0.037244 | 0.000000 | 0.031510 | NaN | 1.635299 | NaN | 5.122706 | low |
| 2 | 2019 | SGD | 0.001736 | NaN | NaN | 0.006162 | 0.000009 | 0.012100 | 0.009388 | 0.012075 | NaN | NaN | 0.004966 | NaN | 0.438356 | NaN | NaN | 0.112056 | 0.111672 | 0.079813 | 0.013339 | 0.800571 | 0.000246 | 0.031045 | 0.001239 | NaN | 0.715801 | NaN | 0.000000 | low |
| 158 | 2019 | SGD | 0.000689 | NaN | NaN | 0.001022 | 0.000876 | 0.009428 | 0.005806 | 0.002317 | NaN | NaN | NaN | NaN | 0.441063 | NaN | NaN | 0.113742 | 0.114150 | 0.079813 | 0.001677 | 0.916706 | 0.772724 | 0.000000 | 0.458568 | NaN | 2.998544 | NaN | 0.000000 | low |
| 159 | 2019 | USD | 0.002046 | NaN | 0.000268 | 0.000567 | 0.001116 | 0.009383 | 0.005858 | 0.001266 | NaN | NaN | NaN | NaN | 0.439226 | NaN | NaN | 0.112298 | 0.112135 | 0.080764 | 0.000675 | 0.985506 | 0.447498 | 0.000000 | 0.233019 | NaN | 5.429553 | NaN | 25.758333 | low |
def remove_duplicate(data):
'''
Description: Returns count of duplicate data
'''
data = data.drop_duplicates()
return data.loc[data.duplicated()].shape[0]
number_duplicate_rows = remove_duplicate(df)
df = df.drop_duplicates()
df.shape
(120, 30)
Out of 160 rows, 40 rows were duplicate rows sand hence removed
def missing_value(x):
'''
Description: Returns missing value number for each column
'''
count_missing = x.isnull().sum()
percent_missing = x.isnull().sum() * 100 / len(x)
missing_value_df = pd.DataFrame({'column_name': x.columns,
'count_missing':count_missing,
'percent_missing': percent_missing})
missing_value_df = missing_value_df[missing_value_df['count_missing']!=0]
missing_value_df.sort_values(by=['count_missing'],ascending=False, inplace = True)
return missing_value_df
missing_value(df)
| column_name | count_missing | percent_missing | |
|---|---|---|---|
| Unnamed: 25 | Unnamed: 25 | 120 | 100.000000 |
| Unnamed: 27 | Unnamed: 27 | 120 | 100.000000 |
| Goodwill | Goodwill | 114 | 95.000000 |
| Other Noncurrent Financial Liabilities | Other Noncurrent Financial Liabilities | 113 | 94.166667 |
| Other Current Financial Assets | Other Current Financial Assets | 112 | 93.333333 |
| Other Noncurrent Financial Assets | Other Noncurrent Financial Assets | 110 | 91.666667 |
| Other Current Financial Liabilities | Other Current Financial Liabilities | 83 | 69.166667 |
| Noncurrent Loans And Borrowings | Noncurrent Loans And Borrowings | 82 | 68.333333 |
| Intangible Assets Other Than Goodwill | Intangible Assets Other Than Goodwill | 76 | 63.333333 |
| Current Loans And Borrowings | Current Loans And Borrowings | 70 | 58.333333 |
| Finance Costs | Finance Costs | 25 | 20.833333 |
| Cash And Bank Balances | Cash And Bank Balances | 6 | 5.000000 |
| Trade And Other Receivables Current | Trade And Other Receivables Current | 6 | 5.000000 |
| Profit Loss | Profit Loss | 5 | 4.166667 |
Since this values are related to financial data, it is necessary to find correct imputation technique. For example, 95% of values of Goodwill column is missing so instead of dropping the column, it is better to impute with 0 as it is possible that other companies were not able to get Goodwill. Same stetegy has been applied to all the columns
# Dropping Unnamed: 25 and Unnamed: 27 right away
df.drop(['Unnamed: 25', 'Unnamed: 27'], axis=1, inplace=True)
df.shape
(120, 28)
df["Risk_bool"] = np.where(df["Risk"]=="low",0,1)
df.drop(['Risk'], axis=1, inplace=True)
profile = ProfileReport(df)
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]